Skip to content

Conversation

@chraac
Copy link
Contributor

@chraac chraac commented Jul 6, 2025

Performance Optimization for Quantization Operations

Overview

This PR introduces significant performance optimizations for quantized neural network operations in the hexagon-npu device backend, focusing on improved memory management, vectorized operations, and enhanced data type support.

Key Changes

Performance Optimizations

  • Optimized dot product implementations with mixed-precision support (F16×F32)
  • Improved VTCM cache utilization and reduced memory allocations
  • Enhanced matrix multiplication with better loop structure and prefetching

Quantization Improvements

  • Implemented dual/quad block processing for Q4_0 and Q8_0 operations
  • Added specialized aligned/unaligned code paths
  • Added configurable F16/F32 dequantization targets

Performance Impact

Performance benchmarks comparing Hexagon NPU with CPU implementation across various operations show interesting patterns based on batch size and quantization method.

Matrix Multiplication Performance

Operation Dimensions Hexagon NPU (GFLOPS) CPU (GFLOPS) NPU/CPU Ratio
MUL_MAT (q4_0) n=1, k=14336 15.62 37.49 0.42x
MUL_MAT (q4_0) n=4, k=14336 35.63 36.31 0.98x
MUL_MAT (q4_0) n=8, k=14336 45.11 45.82 0.98x
MUL_MAT (q4_0) n=512, k=14336 60.28 63.65 0.95x
MUL_MAT (q8_0) n=1, k=14336 3.81 39.68 0.10x
MUL_MAT (q8_0) n=512, k=14336 58.73 70.31 0.84x
MUL_MAT (f16) n=1, k=14336 10.39 11.53 0.90x
MUL_MAT (f16) n=512, k=14336 10.31 25.74 0.40x

Attention Mechanism Performance

Operation Parameters Hexagon NPU (GFLOPS) CPU (GFLOPS) NPU/CPU Ratio
FLASH_ATTN hsk=64, nh=8, kv=4096 2.38 4.73 0.50x
FLASH_ATTN hsk=128, nh=8, kv=4096 4.07 6.04 0.67x
FLASH_ATTN hsk=128, nh=8, kv=16384 4.03 5.61 0.72x

Elementary Operations

Operation Dimensions Hexagon NPU (GB/s) CPU (GB/s) NPU/CPU Ratio
ADD [4096,1,1,1] 20.45 4.08 5.01x
ADD [4096,512,1,1] 25.68 19.14 1.34x

Key Performance Insights

  1. Quantization Impact:

    • q4_0 consistently outperforms other quantization methods on NPU
    • At n=512, q4_0 (60.28 GFLOPS) slightly outperforms q8_0 (58.73 GFLOPS)
    • q4_K performs poorly compared to q4_0 across all batch sizes
  2. Relative to CPU:

    • NPU excels at elementary vector operations (5x faster for small ADD operations)
    • NPU achieves near-CPU performance for matrix multiplication at large batch sizes
    • CPU maintains advantage for attention mechanisms across all tested configurations

test-backend-ops-perf-all.release.hexagon.51c53ae8f.log

test-backend-ops-perf-all.release.cpu.989772c7b.log

Unit tests

[hexagon-npu][ROPE][]supported, dst: f32[100x32x2], src0: f32[100x32x2], src1: i32[2], supported/unsupported: 1058/5194
[hexagon-npu][ROPE][]supported, dst: f32[100x32x2], src0: f32[100x32x2], src1: i32[2], supported/unsupported: 1059/5194
[hexagon-npu]Unsupported op: TRANSPOSE
[hexagon-npu][TRANSPOSE][ (reshaped) (transposed)]unsupported, dst: f32[2x3200], src0: f32[3200x2], supported/unsupported: 1059/5195
unload rpcmem lib successfully
  LLAMA(n_tokens=2): not supported [hexagon-npu] 
  6239/6239 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m

Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

8gen2-test-backend-ops-all.debug.hexagon.51c53ae8f

chraac added 27 commits June 30, 2025 14:45
@github-actions github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels Jul 6, 2025
@chraac chraac closed this Jul 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Compilation issues ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants